The graphs below don’t have proper titles, axis labels,
legends, etc. Please take care to do this on your own graphs.
Throughout this demo we will use the palmerpenguins
dataset. To access the data, you will need to install the
palmerpenguins package:
install.packages("palmerpenguins")
We load the penguins data in the same way as the
previous demos:
library(tidyverse)
library(palmerpenguins)
data(penguins)
head(penguins)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex year
## <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
## 2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
## 3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
## 4 Adelie Torgersen NA NA NA NA <NA> 2007
## 5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
## 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
## # … with abbreviated variable names ¹​flipper_length_mm, ²​body_mass_g
GGallyWe will use the GGally
package to make pairs plots in R with ggplot
figures. You need to install the package:
install.packages("GGally")
Next, we’ll load the package and create a pairs plot of just the
continuous variables using ggpairs. The main arguments you
have to worry about for ggpairs are data,
columns, and mapping:
data: specifies the dataset
columns: Columns of data you want in the plot (can
specify with vector of column names or numbers referring to the column
indices)
mapping: aesthetics using aes(). Most
important one is
aes(color = <variable name>)
First, let’s create a pairs plot by specifying columns
as the four columns of continuous variables (columns 3 through 6):
library(GGally)
penguins %>% ggpairs(columns = 3:6)
Obviously this suffers from over-plotting so we’ll want to adjust the
alpha. An annoying thing is that we specify the
alpha directionly with aes when using
ggpairs:
penguins %>% ggpairs(columns = 3:6, mapping = aes(alpha = 0.5))
Plots along the diagonal show marginal distributions. Plots along the off-diagonal show joint (pairwise) distributions or statistical summaries (e.g., correlation) to avoid redundancy. The matrix of plots is symmetric; e.g., entry (1,2) shows the same distribution as entry (2,1). However, entry (1,2) and entry (2,1) display different bits of information (or alternative plots) about the same distribution.
We could also specify categorical variables in the plot. We also
don’t need to specify column indices if we just select
which columns to use beforehand:
penguins %>%
dplyr::select(bill_length_mm, body_mass_g, species, island) %>%
ggpairs(mapping = aes(alpha = 0.5))
Alternatively, we can use the mapping argument to display these categorical variables in a different manner - and arguably more efficiently:
penguins %>%
ggpairs(columns = c("bill_length_mm", "body_mass_g", "island"),
mapping = aes(alpha = 0.5, color = species))
The ggpairs function in GGally is very
flexible and customizable with regards to which figures are displayed in
the various panels. I encourage
you to check out the vignettes and demos on the package website for more
examples. For instance, in the pairs plot below I decide to display
the regression lines and make other adjustments to the off-diagonal
figures:
penguins %>%
ggpairs(columns = c("bill_length_mm", "body_mass_g", "island"),
mapping = aes(alpha = 0.5, color = species),
lower = list(
continuous = "smooth_lm",
combo = "facetdensitystrip"
),
upper = list(
continuous = "cor",
combo = "facethist"
)
)
You can also proceed to customize the pairs plot in the same manner
as ggplot figures:
penguins %>%
dplyr::select(species, body_mass_g, ends_with("_mm")) %>%
ggpairs(mapping = aes(color = species, alpha = 0.5),
columns = c("flipper_length_mm", "body_mass_g",
"bill_length_mm", "bill_depth_mm")) +
scale_colour_manual(values = c("darkorange","purple","cyan4")) +
scale_fill_manual(values = c("darkorange","purple","cyan4")) +
theme_bw() +
theme(strip.text = element_text(size = 7))
ggcorrplotWe can visualize the correlation matrix for the variables in a
dataset using the ggcorrplot
package. You need to install the package:
install.packages("ggcorrplot")
Next, we’ll load the package and create a correlogram using only the continuous variables. To do this, we first need to compute the correlation matrix for these variables:
penguins_cor_matrix <- penguins %>%
dplyr::select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) %>%
cor(use = "complete.obs")
penguins_cor_matrix
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## bill_length_mm 1.0000000 -0.2350529 0.6561813 0.5951098
## bill_depth_mm -0.2350529 1.0000000 -0.5838512 -0.4719156
## flipper_length_mm 0.6561813 -0.5838512 1.0000000 0.8712018
## body_mass_g 0.5951098 -0.4719156 0.8712018 1.0000000
NOTE: Since there are missing values in the
penguins data we need to indicate in the cor()
function how to handle missing values using the use
argument. By default, the correlations are returned as NA,
which is not what we want. Instead, we can change this to only use
observations without NA values for the considered columns
(see help(cor) for more options).
Now, we can create the correlogram using ggcorrplot()
using this correlation matrix:
library(ggcorrplot)
ggcorrplot(penguins_cor_matrix)
There are several ways we can improve this correlogram:
type input: the default is full, we can
make it lower or upper instead:ggcorrplot(penguins_cor_matrix, type = "lower")
hc.order = TRUE:ggcorrplot(penguins_cor_matrix, type = "lower", hc.order = TRUE)
lab = TRUE - but we should
round the correlation values first using the round()
function:ggcorrplot(round(penguins_cor_matrix, digits = 4),
type = "lower", hc.order = TRUE, lab = TRUE)
method input to circle so that the
size of the displayed circles is mapped to the absolute value of the
correlation value:ggcorrplot(penguins_cor_matrix, type = "lower", hc.order = TRUE,
method = "circle")
You can ignore the Warning message that is displayed -
just from the differences in ggplot implementation.
GGallyIn a parallel coordinates plot, we create an axis for each varaible and align these axes side-by-side, drawing lines between observations from one axis to the next. This can be useful for visualizing structure among both the variables and observations in our dataset. These are useful when working with a moderate number of observations and variables - but can be overwhelming with too many.
We use the ggparcoord() function from the GGally
package to make parallel coordinates plots:
penguins %>%
ggparcoord(columns = 3:6)
There are several ways we can modify this parallel coordinates plot:
alphaLines input to help handle
overlap:penguins %>%
ggparcoord(columns = 3:6, alphaLines = .2)
penguins %>%
ggparcoord(columns = 3:6, alphaLines = .2, groupColumn = "species")
scale input, which by default is std that is
simply subtracting the mean and dividing by the standard deviation. We
could instead use uniminmax so that minimum of the variable
is zero and the maximum is one:penguins %>%
ggparcoord(columns = 3:6, alphaLines = .2, groupColumn = "species",
scale = "uniminmax")
order input (see help(ggparcoord) for
details). There appears to be some weird errors however with the
different options, but you can still manually provide the order of
indices as follows:penguins %>%
ggparcoord(columns = 3:6, alphaLines = .2, groupColumn = "species",
order = c(6, 5, 3, 4))